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We present an implementation of the disconnected diagram contributions to quantities such as the 
flavor-singlet pseudoscalar meson mass which are accelerated by GPGPU technology utilizing 
the NVIDIA CUDA platform. To enable the exact evaluation of the disconnected loops we use a 
16^ X 32 lattice and A'^ = 2 Wilson fermions simulated by the SESAM Collaboration. The discon- 
nected loops are also computed using stochastic methods with several noise reduction techniques. 
In particular, we analyze various dilution schemes as well as the recently proposed truncated 
solver method. We find consistency among the different methods used for the determination of 
the 77' mass, albeit that the gauge noise for the ensemble studied is large. We also find that the 
effect of 'dilution' does not go beyond that of optimal statistical noise in many cases. It has been 
observed, however, that spin dilution does have a significant effect for some quantities studied. 
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1. Introduction 

An accurate estimate of disconnected contributions to flavor singlet quantities remains one of 
the most computationally demanding problems in hadronic physics. The most commonly adopted 
approach is to apply stochastic methods in order to estimate the quark propagator. A number of 
methods to reduce the stochastic noise inherent in such an approach has been developed and their 
respective merits investigated in detail in Ref. [1]. Such methods typically require large numbers 
of Dirac matrix inversions and hardware accelerators, such as graphics processors (GPUs), can 
dramatically accelerate these inversions [2]. 

The main goal of the present study is two-fold: Firstly, we compute the disconnected contribu- 
tion to the flavor-singlet pseudo-scalar meson, rj', mass which is also related to the Ua{1) anomaly 
in QCD. Here, this is used as a case-study for the purposes of evaluating the efficacy of the imple- 
mentation. Secondly, we examine the efficiency of various stochastic noise reduction techniques. 
More precisely, at this stage, we consider two techniques of variance reduction: partitioning (or 
dilution) [3] and the truncated solver method [4]. We performed an exact evaluation of the discon- 
nected loops for Nf = 2 Wilson fermions on a lattice of size 16^ x 32 using GPUs. The calculation 
is then repeated using stochastic methods. The exact calculation gives us an accurate benchmark 
by which to compare all stochastic variance reduction methods and explicitly exposes the gauge 
noise underlying each quantity to be measured. 

2. Lattice ensemble and simulation parameters 

For this exploratory study we use Nf = 2 Wilson fermions at /3 = 5.6 and hopping parameter 
K = 0.157, which con^esponds to pion mass of nijc = 884 MeV on a lattice of size 16^ x 32 [5]. The 
lattice spacing is a = 0.08 fm as determined from the nucleon mass at the physical point [6]. For 
constructing the meson propagators we utilized both local and smeared quark fields. In the latter 
case, we apply gauge-covariant Gaussian smearing using a range of smearing parameters. 

The stochastic estimate of the disconnected quark loops is performed using complex Z2 noise 
for the source vectors in combination with several partitioning (dilution) schemes and the truncated 
solver method [1]. Specifically, we consider various combinations of space, spin and color dilution 
schemes. Colour dilution leads to a multiplicative factor of 3 for the number of inversions. In spin 
space, a full dilution leads to a factor of 4 for the inversions. In this case an even-odd partitioning 
of the space can alternatively be employed leading to an increase of a factor of 2 in the number of 
inversions. For spatial dilutions, in addition to an even-odd dilution, we have also applied a cubic 
dilution, where separate sources are placed on each vertex of an elementary 3-d cube and repeated 
throughout the lattice, leading to an increase of a factor of 8 in the number of inversions. Time 
dilution is applied in all cases and translational invariance exploited so that this does not increase 
the number of required inversions. 

The truncated solver method [1] effectively partitions the problem into a low precision and 
high precision space. A large number of low precision inversions ai^e caiTied out to achieve an 
approximation to the propagator with low stochastic error (but only accurate to low precision). A 
high precision stochastic correction is then applied using a small ensemble with the corresponding 
inversions carried out to high precision. We use a stochastic ensemble of 5000 noise vectors for the 
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low precision space with an ensemble of 500 noise vectors for the high-precision correction. The 
inversion tolerance for the low precision was chosen to be 10~^ such that one can restrict oneself to 
a single precision conjugate gradient inversion (which is very efficient on GPU accelerators), while 
in the case of full precision the tolerance was set to 10^^°. The ensemble sizes were chosen to be 
quite large in order to avoid any quantity-specific tuning of the ensemble sizes. 

Finally, as was already mentioned, the exact evaluation of the all-to-all propagator is also 
carried out. This is clearly the most computational intensive part, and was only possible due to 
the use of graphics accelerators employing the QUDA library (as was used for all inversions), 
which provides mixed precision implementations of CG and BiCGstab solvers for the NVIDIA 
CUDA platform [7]. This provides a benchmark at the level of gauge-noise for all quantities with 
contributions from disconnected loops. 

3. Results 

For all-to-all propagators, a general isovector two-point correlation function, C^'^(p,A?), for 
the creation of a particle at timeslice t with momentum p from the operator Fa and its annihilation 
at timeslice t + At with the operator Fb is given by, 

C«-^(p,A0 = --|4;£(Tr(5f(y,?;x,? + A0r^(p)5f(x,? + A/;y,0r'*(p))>, (3.1) 

where Sf{x,t;x',t') is the propagator from spacetime point {x,t) to spacetime point {x' ,t'), spin 
and colour indices are suppressed and phases for momentum projections (and quark smearing op- 
erations) are incorporated into the definition of the operators Fa and F5. For isoscalar quantities, 
disconnected loops give a contribution D^^{p,At) to the correlation function, 

D«^(p,AO = £(Tr(5f(x,?;x,0r«(p)5f (y,? + A?;y,?+AOr^(p))), (3.2) 

In our particular case of the rj' meson in wNf = 2 gauge ensemble, if we suppress all operator and 
momentum indices, 

C^'{t) = Cn{t)-2D{t). (3.3) 

For mesons on lattices with periodic boundary conditions C(?) ~ e^'"' for t large 

(where is the lattice temporal extent). We can therefore analyze the ratio of the disconnected 
quark loop, D{t), and connected correlation function, Cji{t), to extract the flavour-singlet pseu- 
doscalar meson mass, 

D{t) ^-'V' + £.-'"^'(^^-') 

where m.Ti and m^i are the masses of the n and 77' mesons and A, B are additional fit parameters. 
ffiji can be determined separately to the 1 % level and inserted as a prior leaving only 3 parameters 
in the fit function. This approach also accounts for independent smearing of the connected and 
disconnected loops. 

In the case of the exact evaluation, the only source of eiTor comes from the statistical error 
of the gauge ensemble, and therefore we will employ this fact to assess the results obtained using 
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Figure 1: Ratio ^=r^ for the rj' meson computed using exact approach (without smearing in red and with 
smearing in green colours). 



Stochastic methods. We compare exact resuhs for the ratio D{t)/Cji{t) obtained with local quark 
field operators with those obtained with smeared quark fields in Fig. 1. Given that m„ can be 
determined to the 1 % level it is clear that the gauge noise derived from the disconnected loops in 



this quantity is large. A naive fitting of the data constrained between = 2a and t„ 



1 la gives 



a value for the mass am^, = 0.41 it 0.04 and am^, = 0.40 it 0.05 for local and smeared operators, 
respectively. Here we used aniji = 0.3454(19). On the other hand, fitting in the range = 3a and 
tmax = 110' gives US accordingly aivf^, = 0.51 ±0.07 and am'^, = 0.49 ±0.09. 

If we look at the same quantities using the truncated solver method, the corresponding plots in 
logarithmic scale (for local and smeared operators) are given in Fig. 2. The results of fitting in the 
range = 2a and t,nax = 10a gives am^, = 0.42 ± 0.07 for local fields and am^, = 0.42 ± 0.06 for 
smeared fields, respectively. Fitting in the range ?„„•„ = 3a and t^ax = 10a provides the following 



estimates for the pseudo-scalar mass: ani^ 



0.57 ±0.11 and am^, 



0.51 ±0.1. These results 



are summarized in Table 3. The value of am^/ = 0.40(5) is relatively consistent with the results 
obtained using Nf = 2 twisted mass fermions at niji ~ 500 MeV [8]. 

We have also considered the stochastic estimate of the disconnected diagrams using Z2 noise 
and 2 different approaches. First, we examine the number of noise vectors required such that 
one can reach a level of stochastic accuracy consistent with the statistical noise of the gaugefield 
ensemble. To this end, we have inspected the trace of zero-momentum projected disconnected loop 
Tr{Sf {x,x)r) , for a range of operators P. For example, for the particular case of a dilution approach 
with a 75 operator insertion, the left of Fig. 3 shows the dependency of the magnitude of the eiTor in 
the trace on the number of noise vectors. This figure shows the number of inversions, Ninv, required 
for each dilution scheme along the ;c-axis, e.g., full colour dilution with full spin dilution would 
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Figure 2: Ratio for the ?]' meson computed using TSM (without smearing in red and with smearing in 
green colours). 



Table 1: The rj' mass using exact and the truncated solver method (TSM) for the evaluation of the discon- 
nected loop. 









am^, 


amf^, 


Exact 


2 


11 


0.41 ±0.04 


0.40 ±0.05 


Exact 


3 


11 


0.51 ±0.07 


0.49 ±0.09 


TSM 


2 


10 


0.42 ±0.07 


0.42 ±0.06 


TSM 


3 


10 


0.57±0.11 


0.51 ±0.1 



require 12 inversions (a factor of 3 for each colour and 4 for each spin, as described earlier). As a 
reference, we also plot the optimal statistical error (behaving as ^=) extrapolating from the first 
data point to show the expected behaviour of increasing the ensemble size. Finally we insert the 
gauge-level noise (from the exact calculation) at the point where this gauge error is consistent with 
overall error from the optimal error extrapolation. As one can see, in the case of the 75 -operator the 
dilution approach is consistent with the optimal statistical error and one needs at least 37 inversions 
to achieve gauge noise accuracy. A similar analysis can be done for other disconnected loops, on 
the right of Fig. 1 we show results for an identity operator insertion, where the gauge noise can be 
reached with just 5 noise vectors and, consequently, dilution can have no beneficial effect for the 
measurement. Clearly, the size of the stochastic ensemble required is operator-dependent, as noted 
in Ref. [1]. 

In Fig. 4, we show similar plots for a 71 73 operator insertion. On the left we show an identical 
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Figure 3: Magnitude of errors for rr(5f (x,x)r). Right: for the 75-operator (gauge level noise achievable 
with 37 inversions); Left: for the identity operator (gauge level noise achievable with 5 inversions). 
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Figure 4: Magnitude of errors for Tr{Sf{x,x)Y]_Y3)- Left: for all dilution schemes; Right: for all dilution 
schemes which include full spin dilution (gauge level noise achievable with 174 inversions) 



plot to those in Fig. 3 with the exception that the gauge noise cannot be reached within a sample 
size of 1000 and the gauge noise is simply plotted near this limit for illustrative purposes. On the 
right we plot the same data for the specific cases where full spin dilution is used. As can be seen, 
spin dilution has, in this case, a dramatic effect that allows the achievement of gauge level noise 
within 174 inversions. This effect is likely due to the strong off-diagonal nature of this gamma 
combination in this basis and has been also been observed in other quantities. Also, we again 
observe that dilution behaves consistently with the optimal statistical error. 

4. Summary 

We have computed the disconnected contribution to 77' meson mass using both exact and 
stochastic evaluation. Stability in the fit region has not been observed and the level of noise from 
the gaugefield ensemble is large, particularly at large separations. We have also compared the 
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truncated solver method against the exact approach in our attempt to evaluate the ?]' meson mass 
and find results consistent though non-conclusive between the two approaches. This is of course 
due to the gaugefield ensemble noise inherent in the quantity for the sample. 

We analyzed efficiency of different noise reduction techniques using the gaugefield ensemble 
noise from the exact evaluation as a benchmark. We find that, on these lattices, for the stochastic 
methods the gauge noise becomes the dominant source of the error akeady for 37 inversions in the 
case of rr(5f (x,x)y5) and just 5 inversions in the case of Tr{SF{x,x)). The statistical error from 
using the dilution approach appears to behave similarly to statistical noise in many cases for these 
types of operators. In particular cases however, such as Tr{SF{x,x)YiY3), spin dilution has been 
seen to have a significant effect. 
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